Now that we are familiar with pandas' data structures, we can turn our attention to some of the intermediate features of data frames, which include:
In this section, we will manipulate data collected from ocean-going vessels on the eastern seaboard. Vessel operations are monitored using the Automatic Identification System (AIS), a safety at sea navigation technology which vessels are required to maintain and that uses transponders to transmit very high frequency (VHF) radio signals containing static information including ship name, call sign, and country of origin, as well as dynamic information unique to a particular voyage such as vessel location, heading, and speed.
The International Maritime Organization’s (IMO) International Convention for the Safety of Life at Sea requires functioning AIS capabilities on all vessels 300 gross tons or greater and the US Coast Guard requires AIS on nearly all vessels sailing in U.S. waters. The Coast Guard has established a national network of AIS receivers that provides coverage of nearly all U.S. waters. AIS signals are transmitted several times each minute and the network is capable of handling thousands of reports per minute and updates as often as every two seconds. Therefore, a typical voyage in our study might include the transmission of hundreds or thousands of AIS encoded signals. This provides a rich source of spatial data that includes both spatial and temporal information.
For our purposes, we will use summarized data that describes the transit of a given vessel through a particular administrative area. The data includes the start and end time of the transit segment, as well as information about the speed of the vessel, how far it travelled, etc.
In [ ]:
import pandas as pd
import numpy as np
vessels = pd.read_csv('../data/AIS/vessel_information.csv', index_col=0)
In [ ]:
# Sample Series object
flag = vessels.flag
flag
In [ ]:
# Numpy-style indexing
flag[:10]
In [ ]:
# Indexing by label
flag[[298716,725011300]]
In a DataFrame
we can slice along either or both axes:
In [ ]:
vessels[['num_names','num_types']]
In [ ]:
vessels[vessels.max_loa>700]
The indexing field loc
allows us to select subsets of rows and columns in an intuitive way:
In [ ]:
vessels.loc[720768000, ['names','flag', 'type']]
In [ ]:
vessels.loc[:4731, 'names']
Slicing also works with string variables, since an index has an intrinsic order, regardless of label:
In [ ]:
vessels.loc[:310, 'flag':'loa']
In addition to using loc
to select rows and columns by label, pandas also allows indexing by position using the iloc
attribute.
So, we can query rows and columns by absolute position, rather than by name:
In [ ]:
vessels.iloc[:5, 5:8]
In [ ]:
# Write your answer here
In [ ]:
np.random.seed(42)
normal_vals = pd.DataFrame({'x{}'.format(i):np.random.randn(100) for i in range(5)})
normal_vals.head()
In [ ]:
normal_vals.where(normal_vals>0).head()
where
includes an optional other
argument that accepts a scalar or tabular values (or a callable) to replace values in the DataFrame
that do not satisfy the condition.
For example, we can use this to return the absolute values of X
:
In [ ]:
normal_vals.where(normal_vals>0, other=-normal_vals).head()
Similarly, a callable can be used when we need to modify the replaced value:
In [ ]:
normal_vals.where(normal_vals>0, other=lambda y: -y*100).head()
Conversely, mask
is the inverse boolean of where
:
In [ ]:
normal_vals.mask(normal_vals>0).head()
In [ ]:
normal_vals[(normal_vals.x1 > normal_vals.x0) & (normal_vals.x3 > normal_vals.x2)].head()
For a more concise (and readable) syntax, we can use the new query
method to perform selection on a DataFrame
. Instead of having to type the fully-specified column, we can simply pass a string that describes what to select. The query above is then simply:
In [ ]:
normal_vals.query('(x1 > x0) & (x3 > x2)').head()
The DataFrame.index
and DataFrame.columns
are placed in the query namespace by default. If you want to refer to a variable in the current namespace, you can prefix the variable with @
:
In [ ]:
min_loa = 700
In [ ]:
vessels.query('max_loa > @min_loa')
DataFrame
and Series
objects allow for several operations to take place either on a single object, or between two or more objects.
For this section, we will import a complementary dataset of transits (voyages) of each vessel. This is a much larger dataset, since ships each have multiple transits over several years.
For example, we can perform arithmetic on the elements of two objects, such as combining distances travelled across years. First, let's artificially construct two Series, consisting of segment lengths travelled in 2009 and 2010:
In [ ]:
segments = pd.read_csv('../data/AIS/transit_segments.csv', parse_dates=['st_time'])
In [ ]:
segments['year'] = segments.st_time.dt.year
segments.year.head()
In [ ]:
segments2009 = segments[segments.year==2009]
lengths2009 = pd.Series({name: segments2009[segments2009.name==name].seg_length.sum()
for name in segments2009.name.unique()})
In [ ]:
segments2010 = segments[segments.year==2010]
lengths2010 = pd.Series({name: segments2010[segments2010.name==name].seg_length.sum()
for name in segments2010.name.unique()})
We will learn a much more efficient way of performing this operation when we get to GroupBy
operations a little later on.
In [ ]:
length2009 = segments.loc[segments.year==2009, 'seg_length']
length2009.index = segments.name[segments.year==2009]
length2010 = segments.loc[segments.year==2010, 'seg_length']
length2010.index = segments.name[segments.year==2010]
Now, let's add them together, in hopes of getting 2-year travel totals:
In [ ]:
seg_lengths = lengths2009 + lengths2010
seg_lengths
Pandas' data alignment places NaN
values for labels that do not overlap in the two Series. In fact, the majority of ships only have data for one of the two years.
In [ ]:
seg_lengths.notnull().mean()
While we do want the operation to honor the data labels in this way, we probably do not want the missing values to be filled with NaN
. We can use the add
method to calculate ship travel totals by using the fill_value
argument to insert a zero for years that do not have data for a particular vessel:
In [ ]:
lengths2009.add(lengths2010, fill_value=0)
Operations can also be broadcast between rows or columns.
For example, if we subtract the maximum LOA (ship length) from the max_loa
column, we get a the differences between the size of a given vessel to the largest vessel.
In [ ]:
vessels.max_loa - vessels.max_loa.max()
Or, looking at things row-wise, we can see how a particular vessel compares with the rest of the group with respect to important features:
In [ ]:
vessels[vessels.max_loa==vessels.max_loa.max()]
In [ ]:
recs = vessels[['num_names','num_loas', 'max_loa', 'num_types']]
diff = recs - recs.loc[354092000]
diff[:10]
We can also apply functions to each column or row of a DataFrame
In [ ]:
recs.apply(np.median)
In [ ]:
def range_calc(x):
return x.max() - x.min()
In [ ]:
recs.apply(range_calc)
In [ ]:
segments.sort_index().head()
In [ ]:
segments.sort_index(ascending=False).head()
Try sorting the columns instead of the rows, in ascending order:
In [ ]:
segments.sort_index(axis=1).head()
We can also use sort_values
to sort a Series
by value, rather than by label.
In [ ]:
segments.seg_length.sort_values(ascending=False).head(10)
For a DataFrame
, we can sort according to the values of one or more columns using the by
argument of sort_values
:
In [ ]:
segments[['avg_sog','max_sog','min_sog']].sort_values(ascending=[False,True],
by=['max_sog', 'min_sog']).head(10)
Ranking does not re-arrange data, but instead returns an index that ranks each value relative to others in the Series.
In [ ]:
segments.avg_sog.rank()
Ties are assigned the mean value of the tied ranks, which may result in decimal values.
In [ ]:
pd.Series([100,100]).rank()
Alternatively, you can break ties via one of several methods, such as by the order in which they occur in the dataset:
In [ ]:
segments.avg_sog.rank(method='first').sort_values()
Calling the DataFrame
's rank
method results in the ranks of all columns:
In [ ]:
segments.rank(ascending=False).head()
While the mmsi
(Maritime Mobile Service Identity) field represents a unique identifier in the vessels
table, it does not in the segments
table.
We can user hierarchical indexing to build a unique index, however, that may be more meaningful than the ordered set of integers that are given as a default index.
In [ ]:
segments_hier = segments.set_index(['mmsi', 'name', 'year', 'transit', 'segment'])
segments_hier.head(10)
In [ ]:
np.random.randint(0, len(segments))
In [ ]:
segments.loc[np.random.randint(0, len(segments))]
This index is a MultiIndex
object that consists of a sequence of tuples, the elements of which is some combination of the three columns used to create the index. Where there are multiple repeated values, Pandas does not print the repeats, making it easy to identify groups of values.
In [ ]:
segments_hier.index.is_unique
Try using this hierarchical index to retrieve the first segment of the tenth transit of the Sentinel (mmsi=366766980):
In [ ]:
segments_hier.loc[(366766980, 'Sentinel', 2009, 10, 1)]
With a hierachical index, we can select subsets of the data based on a partial index:
In [ ]:
segments.loc[9]
Hierarchical indices can be created on either or both axes. Here is a trivial example:
In [ ]:
frame = pd.DataFrame(np.arange(12).reshape(( 4, 3)),
index =[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns =[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
frame
If you want to get fancy, both the row and column indices themselves can be given names:
In [ ]:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
With this, we can do all sorts of custom indexing:
In [ ]:
frame.loc['a', 'Ohio']
Try retrieving the value corresponding to b2
in Colorado
:
In [ ]:
# Write your answer here
Additionally, the order of the set of indices in a hierarchical MultiIndex
can be changed by swapping them pairwise:
In [ ]:
segments_hier.swaplevel('mmsi', 'name').head()
In [ ]:
segments_hier.swaplevel('mmsi', 'name').loc['Sentinel']
Alternately, one can index based on an arbitrary level using the xs
(cross-section) method:
In [ ]:
segments_hier.xs('Sentinel', level='name')
Data can also be sorted by any index level, using sortlevel
:
In [ ]:
segments.sortlevel('max_sog', ascending=False).head()
The occurence of missing data is so prevalent that it pays to use tools like Pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.
Missing data are represented in Series
and DataFrame
objects by the NaN
floating point value. However, None
is also treated as missing, since it is commonly used as such in other contexts (e.g. NumPy).
In [ ]:
foo = pd.Series([np.nan, -3, None, 'foobar'])
foo
In [ ]:
foo.isnull()
To illustrate how to deal with missing values in pandas, we will import an educational database, which reports test scores for children with hearing disabilities, along with associated covariates, several of which are missing.
In [ ]:
testing = pd.read_csv('../data/test_scores.csv', index_col=0)
testing.head(10)
Missing values may be dropped or indexed out:
By default, dropna
drops entire rows in which one or more values are missing.
In [ ]:
testing.dropna().head(10)
A similar result can be obtained via indexing.
In [ ]:
testing.isnull().head()
In [ ]:
testing[testing.notnull()].head()
This can be overridden by passing the how='all'
argument, which only drops a row when every field is a missing value.
In [ ]:
testing.dropna(how='all')
This can be customized further by specifying how many values need to be present before a row is dropped via the thresh
argument.
In [ ]:
testing.dropna(thresh=10)
This is typically used in time series applications, where there are repeated measurements that are incomplete for some subjects.
In [ ]:
# Write your answer here
Rather than omitting missing data from an analysis, in some cases it may be suitable to fill the missing value in, either with a default value (such as zero) or a value that is either imputed or carried forward/backward from similar data points. We can do this programmatically in Pandas with the fillna
argument.
In [ ]:
testing.fillna(0).head(10)
In [ ]:
testing.fillna({'family_inv': 0, 'mother_hs':1}).head(10)
Notice that fillna
by default returns a new object with the desired filling behavior, rather than changing the Series
or DataFrame
in place (in general, we like to do this, by the way!).
Missing values can also be interpolated, using any one of a variety of methods:
In [ ]:
testing.fillna(method='bfill').head(10)
We often wish to summarize data in Series
or DataFrame
objects, so that they can more easily be understood or compared with similar data. The NumPy package contains several functions that are useful here, but several summarization or reduction methods are built into Pandas data structures.
In [ ]:
testing.sum()
Clearly, sum
is more meaningful for some columns than others. For methods like mean
for which application to string variables is not just meaningless, but impossible, these columns are automatically exculded:
In [ ]:
testing.mean()
The important difference between NumPy's functions and Pandas' methods is that the latter have built-in support for handling missing data.
Sometimes we may not want to ignore missing values, and allow the nan
to propagate.
In [ ]:
testing.mean(skipna=False)
Passing axis=1
will summarize over rows instead of columns, which only makes sense in certain situations.
In [ ]:
nonenglish_nonwhite_withHS = testing[['non_english', 'mother_hs', 'non_white']].prod(axis=1)
nonenglish_nonwhite_withHS
A useful summarization that gives a quick snapshot of multiple statistics for a Series
or DataFrame
is describe
:
In [ ]:
testing.describe()
describe
can detect non-numeric data and sometimes yield useful information about it.
In [ ]:
testing.non_english.describe()
We can also calculate summary statistics across multiple columns, for example, correlation and covariance.
$$cov(x,y) = \sum_i (x_i - \bar{x})(y_i - \bar{y})$$
In [ ]:
testing.score.cov(testing.age_test)
In [ ]:
testing.score.corr(testing.age_test)
Try running corr
on the entire testing
DataFrame to see what is returned:
In [ ]:
# Write answer here
If we have a DataFrame
with a hierarchical index (or indices), summary statistics can be applied with respect to any of the index levels:
In [ ]:
segments_hier.head()
In [ ]:
segments_hier.sum(level='transit').seg_length
In [ ]:
testing.to_csv("testing.csv")
The to_csv
method writes a DataFrame
to a comma-separated values (csv) file. You can specify custom delimiters (via sep
argument), how missing values are written (via na_rep
argument), whether the index is writen (via index
argument), whether the header is included (via header
argument), among other options.
An efficient way of storing data to disk is in binary format. Pandas supports this using Python’s built-in pickle serialization.
In [ ]:
testing.to_pickle("testing_pickle")
The complement to to_pickle
is the read_pickle
function, which restores the pickle to a DataFrame
or Series
:
In [ ]:
pd.read_pickle("testing_pickle").head()
As Wes warns in his book, it is recommended that binary storage of data via pickle only be used as a temporary storage format, in situations where speed is relevant. This is because there is no guarantee that the pickle format will not change with future versions of Python.
Python for Data Analysis Wes McKinney